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Abstract 

We  describe  the  WHY2-ATLAS  intelligent  tutoring  system  for 
qualitative  physics  that  interacts  with  students  via  natural  lan¬ 
guage  dialogue.  We  focus  on  the  issue  of  analyzing  and  re¬ 
sponding  to  multi-sentential  explanations.  We  explore  ap¬ 
proaches  for  achieving  a  deeper  understanding  of  these  expla¬ 
nations  and  dialogue  management  approaches  and  strategies 
for  providing  appropriate  feedback  on  them. 

Introduction 

In  a  tutorial  system  that  interacts  with  a  student  through  nat¬ 
ural  language,  the  system  needs  to  understand  the  user  just 
well  enough  to  respond  appropriately.  What  it  means  to  un¬ 
derstand  well  enough  and  what  it  means  to  respond  appro¬ 
priately  vary  according  to  the  application. 

Most  natural  language  tutorial  applications  have  focused 
on  coaching  either  problem  solving  or  procedural  knowl¬ 
edge  (e.g  Steve  (Johnson  &  Rickel  1997),  Circsim-tutor 
(Evens  et  al.  2001),  BEETLE  (Zinn,  Moore,  &  Core  2002), 
SCoT  (Pon-Barry  et  al.  2004),  inter  alia).  When  coaching 
problem  solving,  simple  short  answer  analysis  techniques 
are  frequently  sufficient  because  the  primary  goal  is  to  lead 
a  trainee  step-by-step  through  problem  solving.  There  is  a 
narrow  range  of  possible  responses  and  the  context  of  the 
previous  dialogue  and  the  question  invite  a  short  answer. 

Any  deeper  analysis  of  short  answers  in  these  cases  results 
in  a  small  return  on  investment  when  the  focus  is  eliciting  a 
step  during  problem  solving.  It  isn’t  until  the  instructional 
objectives  shift  and  a  tutorial  system  attempts  to  explore  a 
student’s  chain  of  reasoning  behind  an  answer  or  decision 
that  deeper  analysis  can  begin  to  pay  off.  And  having  the 
student  construct  more  on  his  own  is  important  for  learning 
perhaps  in  part  because  he  reveals  what  he  does  and  does 
not  understand  (Chi  et  al.  2001).  But  the  difficulty  in  un¬ 
derstanding  the  explanation  increases  with  the  length  of  the 
chain  of  reasoning  being  elicited.  If  just  one  step  in  the  rea¬ 
soning  is  sought,  then  only  deeper  single  sentence  analysis 
is  needed.  This  was  the  case  with  the  GEOMETRY  EXPLA¬ 
NATION  TUTOR  (Aleven  et  al.  2003).  Since  all  the  reasons 
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sought  were  definitions,  terminological  classification  was  a 
good  fit  for  understanding  well  enough  to  respond  appropri¬ 
ately. 

When  the  student  is  invited  to  provide  a  longer  chain  of 
reasoning,  the  explanations  become  multi-sentential.  Com¬ 
pare  the  short  explanations  requested  in  Eigure  1  to  the 
longer  ones  in  Eigures  2  and  3.  The  explanation  in  Eigure  2 
is  part  of  an  initial  student  response  and  Eigure  3  shows  the 
explanation  from  the  same  student  after  several  follow-up 
dialogues  with  the  WHY2-ATLAS  tutoring  system.  A  longer 
explanation  is  unlikely  to  strictly  follow  the  problem  solving 
structure  because  the  student  may  reorganize  it  (e.g.  give  an 
overview  before  going  into  details)  and  may  leave  out  some 
of  the  reasoning,  which  are  both  common  things  to  do  in 
natural  language. 


GEOMETRY  EXPLANATION  TUTOR:  Base  angles  in  what  type 
of  geometric  figure  are  congruent 

Student:  the  bottom  angles  in  an  isoceles  triangle  are  congruent 
< approximately  3  propositions  expressed>  (Aleven  etal.  2003) 

WHY2-AUTOTUTOR:  Once  again,  how  does  Newton’s  third  law 
of  motion  apply  to  this  situation? 

Student:  Does  Newton’s  law  apply  to  opposite  forces? 

< approximately  2  propositions  expressed>  (Graesser  et  al. 
2005). 

WHY2-ATLAS:  Fine.  Using  this  principle,  what  is  the  value  of 
the  horizontal  component  of  the  acceleration  of  the  egg?  Please 
explain  your  reasoning. 

Student:  zero  because  there  is  no  horizontal  force  acting  on  the 
egg  < approximately  3  propositions  expressed> 


Figure  1 :  Examples  of  1  sentence  explanations  from  the  do¬ 
mains  of  geometry  and  qualitative  physics. 

The  only  previous  tutoring  system  that  has  attempted  to 
address  longer  explanations  is  AUTOTUTOR  (Graesser  et  al. 
2005).  It  uses  a  latent  semantic  analysis  (LSA)  approach 
where  the  structure  of  sentences  is  not  considered.  Thus  the 
degree  to  which  details  of  the  explanation  are  understood  is 
limited.  But  this  approach  is  appropriate  given  AUTOTU- 
TOR’s  pedagogical  strategy  of  eliciting  a  single  unit  of  the 
explanation  (about  one  sentence  or  more),  when  LSA  deter¬ 
mines  it  is  missing.  It  first  hints  with  a  short  answer  question 
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Question:  Suppose  a  man  is  in  an  elevator  that  is  falling  without 
anything  touching  it  (ignore  the  air,  too).  He  holds  his  keys 
motionless  right  in  front  of  his  face  and  then  just  releases  his 
grip  on  them.  What  will  happen  to  them?  Explain. 

<omitted  approximately  15  correct  propositions >...  Yet  the 
gravitational  pull  on  the  man  and  the  elevator  is  greater  because 
they  are  of  a  greater  weight  and  therefore  they  will  fall  faster  then 
the  keys.  I  believe  that  the  keys  will  float  up  to  the  deling  as  the 
elevator  continues  falling. 


Figure  2:  Part  of  a  verbatim  student  response  to  the  stated 
problem  before  interacting  with  the  tutoring  system. 


<omitted  approximately  16  correct  propositions >...  Since  <Net 
force  =  mass  *  acceleration>  and  <F=  mass*g>  therefore 
<mass*acceleration=  mass*g>  and  acceleration  and  gravita¬ 
tional  force  end  up  being  equal.  So  mass  does  not  effect  any¬ 
thing  in  this  problem  and  the  acceleration  of  both  the  keys 
and  the  man  are  the  same.  <omitted  approximately  46  correct 
propositions>...we  can  say  that  the  keys  will  remain  right  in 
front  of  the  man’s  face. 


Figure  3:  Part  of  a  verbatim  response  from  the  same  student 
in  Figure  2  after  completing  interaction  with  the  system. 


and  if  that  fails,  prompts  with  a  hll-in-the-blank  question  and 
if  that  fails,  bottoms-out  with  the  missing  unit.  One  way  to 
possibly  improve  is  to  add  pedagogical  strategies  that  elicit 
increasingly  greater  precision  as  students’  explanations  be¬ 
come  less  vague,  (e.g.  “what  can  you  say  about  the  forces  in 
this  problem?”,  “you  are  right  that  the  net  force  is  zero  but 
how  did  you  determine  this?”).  But  to  do  so,  deeper  under¬ 
standing  of  multi-sentential  explanations  is  likely  necessary 
(Chi  etal.  2001). 

In  this  paper  we  will  describe  the  WHY2-ATLAS  quali¬ 
tative  physics  tutoring  system’s  approach  for  supporting  a 
wider  range  of  pedagogical  strategies  and  for  achieving  a 
deeper  understanding.  We  will  end  with  a  discussion  of  the 
system’s  most  recent  evaluation  in  which  student  learning 
gains  were  measured.  Although  the  results  are  promising, 
much  work  remains  to  be  done  to  assess  interactions  be¬ 
tween  the  system’s  understanding  performance  and  learning. 

Dialogue  Management  in  Why2- Atlas 

Lower-level  dialogue  management.  At  the  lowest-level 
dialogue  management  is  a  hnite  state  network  with  a  stack 
that  is  implemented  using  a  reactive  planner  (APE  (Freed¬ 
man  2000)).  Finite  state  approaches  are  appropriate  for  di¬ 
alogues  in  which  the  task  to  be  discussed  is  well-structured 
and  the  dialogue  is  to  be  system-led  (McTear  2002),  as  was 
the  case  for  WHY2-ATLAS. 

A  state  in  the  network  is  either  a  push  to  a  sub-network  as 
with  the  right-most  and  left-most  nodes  in  Figure  4  or  a  tutor 
turn  plus  an  optional  student  response  as  with  the  top  node 
and  its  three  branches  in  Figure  4.  There  is  a  sub-network 
for  each  complex  topic  to  discuss  in  dialogue  so  that  a  state 
is  the  equivalent  of  a  step  in  a  recipe  for  covering  the  topic. 


As  a  runner  pushes  a  ball 
away,  what  horizontal  forces 
act  on  it? 


After  the  push  ends,  what  forces . . . .  ? 


1 

Figure  4:  Finite  State  Model  with  answer  classes  and  op¬ 
tional  steps. 


A  tutor  turn  is  a  ready-to-utter  string.  When  a  tutor  turn  sets 
up  a  discourse  obligation  for  the  student  (e.g.  tutor  asks  a 
question  as  with  the  top  node  in  Figure  4),  there  is  a  set  of 
anticipated  classes  to  recognize  for  each  conceptually  differ¬ 
ent  satisfactory  and  unsatisfactory  response.  The  classihca- 
tion  of  the  student  response  decides  the  next  state  to  which  to 
move.  Thus  each  response  selects  an  arc  between  two  states 
in  the  network.  Classes  that  correspond  to  unsatisfactory  re¬ 
sponses  lead  to  a  state  that  is  a  push  to  a  recipe  that  addresses 
the  unsatisfactory  response.  These  remediation  recipes  are 
written  to  anticipate  an  eventual  return  to  a  state  that  is  the 
next  step  in  the  parent  recipe.  By  default,  if  a  tutor  turn  does 
not  setup  an  obligation  for  the  student  to  respond  then  the 
transition  is  to  the  next  step  in  the  recipe. 

The  anticipated  student  response  classes  for  each  state  are 
further  categorized  as  either  correct  answers,  vague  answers, 
expected  wrong  answers  or  unanticipated  responses.  This 
categorization  of  the  answer  classes  helps  determine  feed¬ 
back  (e.g.  “Correct!”)  which  is  prepended  to  the  ready-to- 
utter  strings  in  the  network  and  helps  in  tracking  the  stu¬ 
dent’s  performance  over  time  when  analyzing  the  dialogue 
history. 

Different  classihcation  techniques  can  be  designated  for 
each  state.  The  default  classihcation  technique  is  short- 
answer  classihcation  since  a  majority  of  responses  are  still 
anticipated  to  be  short- answers.  But  when  the  response  for 
a  state  is  expected  to  be  an  explanation  then  the  explanation 
classiher  is  designated  for  that  state.  Both  classihcation  ap¬ 
proaches  will  be  described  in  more  detail  later  in  the  paper. 

In  addition  to  answer  classes,  three  other  conditions  can 
be  used  in  deciding  which  state  to  go  to  next.  One  is  a  test 
to  skip  a  state  if  the  content  of  that  state  is  already  in  the 
discourse  history  as  with  the  “said”  and  “not  said”  arcs  in 
Figure  4.  The  second  transition  condition  is  a  test  of  which 
difficulty  level  is  appropriate  for  a  student.  For  example, 
there  could  be  an  alternate  state  relative  to  the  last  node  in 
Figure  4  and  the  two  alternate  states  could  have  different  dif- 
hculty  levels  associated  with  them.  The  past  performance  of 
the  student  is  evaluated  to  determine  which  is  the  appropri¬ 
ate  one  to  select.  The  last  transition  condition  is  just  before  a 


pop  from  a  remediation  sub-network  and  tests  that  the  state 
before  the  push  is  still  in  the  student’s  focus  of  attention  ac¬ 
cording  to  the  dialogue  history.  If  it  is  not  in  the  student’s 
focus  of  attention  then  the  tutor  turn  before  the  push  is  re¬ 
peated  and  otherwise  the  pop  is  completed.  In  this  case  part 
of  the  original  network  is  copied  and  inserted  just  before  the 
pop;  just  the  correct  and  the  unanticipated  response  condi¬ 
tions  and  transitions  are  copied.  But  the  path  for  the  unan¬ 
ticipated  response  instead  leads  to  a  tutor  turn  that  states  the 
correct  answer  just  before  the  pop  is  completed. 

Higher-level  dialogue  management.  This  level  of  dia¬ 
logue  management  oversees  the  finite  state  network  and 
picks  between  three  types  of  recipes  that  were  authored  for 
WHY2-ATLAS  (1)  a  high-level  walkthrough  of  the  problem 
solution  or  parts  of  the  problem  solution,  (2)  short  elicita¬ 
tions  of  particular  pieces  of  knowledge  and  (3)  remediations. 
Walkthrough  recipes  are  selected  when  the  student  is  unable 
to  provide  much  in  direct  response  to  the  qualitative  physics 
problem  or  when  the  system  is  unable  to  classify  much  of 
what  the  student  wrote.  Short  elicitations  are  selected  if 
the  student’s  response  is  partially  complete  with  a  few  scat¬ 
tered  gaps  in  order  to  encourage  the  student  to  fill  in  missing 
pieces  of  the  explanation.  Remediations  are  selected  if  er¬ 
rors  or  misconceptions  are  detected  in  the  response.  While 
executing  a  recipe,  pushes  to  recipes  for  subdialogues  that 
are  of  the  same  three  types  (i.e.  walkthrough,  elicitation  or 
remediation)  are  possible  but  typically  are  limited  to  reme¬ 
diations. 

In  the  case  of  single  elicitation  recipes,  the  dialogue  man¬ 
ager  will  present  a  summary  of  what  is  correctly  covered 
according  to  the  response  analysis.  The  content  selected  for 
the  summary  includes  all  nodes  in  a  solution  graph  that  are 
on  the  path  between  the  node  that  is  to  be  elicited  and  the 
first  node  that  is  in  focus  in  the  dialogue  history  (i.e.  what 
was  last  talked  about  in  dialogue).  The  summaries  are  gen¬ 
erated  using  templates  with  clause  slots,  and  clauses  associ¬ 
ated  with  the  selected  nodes  of  the  graph  fill  those  slots. 

Authoring.  High-level  dialogue  management  is  assumed 
or  built  into  the  dialogue  manager  but  an  instructor  must  au¬ 
thor  the  lower-level  finite  state  network.  Instructors  use  a 
scripting  language  (Jordan,  Rose,  &  VanLehn  2001)  to  do 
so.  The  author  must  first  define  recipes  and  their  steps,  de¬ 
fine  the  initial  answer  class  labels,  assign  optional  semantic 
labels  to  be  used  in  implementing  optional  step  and  diffi¬ 
culty  level  transitions,  and  indicate  the  difficulty  levels  for 
each  arc  and  which  steps  are  optional.  The  reasking  states, 
transition  conditions  and  arcs  are  generated  automatically 
from  the  authored  network.  Finally  the  author  must  define 
the  answer  classes  associated  with  the  labels  in  the  script. 
How  answer  classes  are  defined  is  done  differently  for  short- 
answers  and  explanations  and  is  described  in  more  detail  in 
the  next  section. 

Analyzing  Student  Contributions  in 
Why2-Atlas 

When  a  student  contribution  is  to  be  analyzed,  first  an  equa¬ 
tion  identifier  tags  any  physics  equations  in  the  student’s  re¬ 


sponse  and  then  classification  is  done  to  complete  the  as¬ 
sessment  of  the  student’s  natural  language  contributions.  In 
the  case  of  explanations,  the  classification  is  with  respect  to 
steps  in  correct  and  buggy  chains  of  reasoning.  All  answer 
classes  for  explanation  states  (including  the  initial  response 
to  the  qualitative  physics  problem)  are  selected  from  pre¬ 
computed  chains  of  reasoning.  In  the  case  of  short  answers 
the  classification  is  with  respect  to  classes  that  the  author  de¬ 
fines  specifically  for  each  state.  Some  of  these  classes  can 
be  reused  for  other  states  but  it  is  much  less  frequent  than 
with  explanations.  First  we  will  describe  how  explanations 
are  classified  and  then  short-answers.  Finally  we  will  briefly 
describe  the  equation  identifier. 

Explanation  Classification 

Explanation  classification  is  broken  into  two  stages,  (1)  sin¬ 
gle  sentence  analysis,  which  outputs  a  first-order  predicate 
logic  (FOPL)  representation  and  then  (2)  an  assessment  of 
correctness  and  completeness  of  those  representations  with 
respect  to  nodes  in  correct  and  buggy  chains  of  reasoning. 
The  nodes  matched  in  this  final  stage  determine  what  classes 
are  associated  with  the  explanation.  First  we  will  discuss 
single  sentence  analysis  and  then  the  assessment  of  correct¬ 
ness  and  completeness. 

Single  Sentence  Analysis.  Single  sentence  analysis  uses 
three  competing  single  sentence  analysis  methods  and  a 
heuristic  selection  process  to  choose  one  of  the  output  rep¬ 
resentations  for  each  sentence  (Jordan,  Makatchev,  &  Van¬ 
Lehn  2004).  The  rationale  for  using  multiple  approaches  is 
that  the  techniques  available  vary  considerably  in  accuracy, 
processing  time  and  whether  they  tend  to  be  brittle  and  pro¬ 
duce  no  analysis  vs.  a  partial  one.  There  is  also  a  trade-off 
between  these  performance  measures  and  the  amount  of  do¬ 
main  specific  setup  required  for  each  technique  and  there  are 
no  formal  return  on  investment  studies  to  give  us  insight  into 
which  technique  is  the  best  one  to  pick  for  an  application. 

The  first  method,  CARMEL,  provides  combined  syntac¬ 
tic  and  semantic  analysis  using  the  LCFlex  syntactic  parser 
along  with  semantic  constructor  functions  (Rose  2000). 
Given  a  specification  of  the  desired  representation  language, 
it  then  maps  the  analysis  to  this  language.  Then  discourse 
level  processing  attempts  to  resolve  nominal  and  temporal 
anaphora  and  ellipsis  to  produce  the  final  FOPL  represen¬ 
tation  for  each  sentence  (Jordan  &  VanLehn  2002).  Since 
the  knowledge  engineering  effort  for  creating  semantic  con¬ 
structor  functions  is  considerable  there  are  gaps  in  the  cov¬ 
erage  of  these  functions.  Also  there  are  known  gaps  in  the 
discourse  level  processing  with  respect  to  the  WHY2-ATLAS 
domain. 

The  second  method.  Rainbow,  is  a  tool  for  developing 
bag  of  words  (BOW)  text  classifiers  (McCallum  &  Nigam 
1998).  The  classes  of  interest  must  first  be  identified  and 
then  a  text  corpus  annotated  for  example  sentences  for  each 
class.  Lrom  this  training  data  a  bag  of  words  representation 
is  derived  for  each  class  and  a  number  of  algorithms  can 
be  tried  for  measuring  similarity  of  a  new  input  segment’s 
BOW  representation  to  each  class. 

Lor  WHY2-ATLAS,  the  classes  we  use  are  targeted  nodes 


in  the  correct  and  buggy  chains  of  reasoning.  But  there  were 
many  misclassifications  of  sentences  due  to  overlap  in  the 
classes;  that  is,  words  that  discriminate  between  classes  are 
shared  by  many  other  classes  (Pappuswamy  et  al.  2005). 
By  aggregating  classes  and  building  three  tiers  of  BOW  text 
classifiers  that  use  a  kNN  measure,  we  obtained  a  13%  im¬ 
provement  in  classification  accuracy  over  a  single  classifier 
approach  (Pappuswamy  et  al.  2005).  The  first  tier  classifi¬ 
cation  identifies  which  second  tier  classifier  to  use  and  like¬ 
wise  the  second  tier  classifier  selects  the  third  tier  classifier. 
The  third  tier  then  identifies  which  if  any  node  a  sentence 
expresses.  But  even  with  these  improvements,  the  current 
training  data  for  WHY2-ATLAS  is  too  sparse  for  some  classes 
to  achieve  good  accuracy. 

With  the  BOW  approach,  an  assessment  of  correctness 
and  completeness  can  be  skipped  since  a  BOW  class  equates 
to  a  targeted  node.  However,  a  representation  of  the  class 
is  still  needed  by  the  single  sentence  selection  process  de¬ 
scribed  below.  This  representation  translation  is  obtained  by 
looking  up  a  stored  translation  of  the  node  associated  with 
the  identified  class. 

Finally,  the  third  method.  Rappel,  is  a  hybrid  approach 
that  uses  symbolically-derived  syntactic  dependency  fea¬ 
tures  (obtained  via  MlNiPAR  (Lin  &.  Pantel  2001))  to  train 
for  classes  that  are  defined  at  the  representation  language 
level  (Jordan,  Makatchev,  &  VanLehn  2004).  Each  proposi¬ 
tion  in  the  representation  language  corresponds  to  a  template 
in  Rappel.  Each  template  has  its  own  set  of  classes  that 
cover  all  possible  ways  in  which  the  template’s  slots  could 
be  filled.  A  class  indicates  which  slots  in  a  particular  propo¬ 
sition  template  are  filled  with  which  constants.  There  is  a 
one-to-one  correspondence  between  a  filled  template  and  an 
instance  of  a  proposition  in  the  representation  language.  An 
exception  is  body  slots  which  are  handled  by  separate  binary 
classifiers;  one  for  propositions  involving  one  body  and  an¬ 
other  for  those  involving  two  bodies. 

A  separate  classifier  is  trained  for  each  template.  Eor  ex¬ 
ample,  there  is  a  classifier  that  specializes  in  the  velocity 
template  and  another  that  specializes  in  the  acceleration  tem¬ 
plate.  Eor  the  WHY2-ATLAS  domain,  there  are  27  templates 
and  thus  27  classifiers.  Each  classifier  returns  either  a  nil 
which  indicates  that  no  form  of  that  proposition  is  present  or 
a  class  label  that  corresponds  to  one  of  the  possible  comple¬ 
tions  of  the  template.  Classifiers  and  classes  have  been  de¬ 
fined  that  cover  the  entire  WHY2-ATLAS  representation  lan¬ 
guage  but  the  training  data  is  sparse  relative  to  the  number 
of  classes. 

Next  one  of  the  three  possible  outputs  of  the  single  sen¬ 
tence  analyzers  must  be  selected.  The  selection  process  is 
independent  of  the  single  sentence  analysis  techniques  used; 
it  depends  only  on  the  system’s  EOPL  representation  lan¬ 
guage.  Heuristics  estimate  whether  a  resulting  representa¬ 
tion  either  over  or  under  represents  the  sentence  by  match¬ 
ing  the  root  forms  of  the  words  in  the  natural  language  sen¬ 
tence  to  the  constants  in  the  representation  returned  by  each 
method. 

If  the  selected  representation  is  not  a  product  of  the  multi¬ 
level  BOW  approach,  then  the  representation  is  assessed  for 
correctness  and  completeness,  as  described  next.  Recall  that 


the  multi-level  BOW  approach  directly  identifies  which  tar¬ 
geted  node  in  the  chain  of  reasoning  a  sentence  represents. 

Analyzing  correctness  and  completeness  As  the  final 
step  in  analyzing  a  student’s  explanation,  an  assessment  of 
correctness  and  completeness  is  performed  by  matching  the 
EOPL  representations  of  the  student’s  response  to  nodes  of 
an  augmented  assumption-based  truth  maintenance  system 
(ATMS)  (Makatchev  &  VanLehn  2005).  An  ATMS  for  each 
physics  problem  is  generated  off-line.  The  ATMS  compactly 
represents  the  deductive  closure  of  a  problem’s  givens  with 
respect  to  a  set  of  both  good  and  buggy  physics  rules.  That 
is,  each  node  in  the  ATMS  corresponds  to  a  proposition  that 
follows  from  a  problem  statement.  Each  anticipated  student 
misconception  is  treated  as  an  assumption  (in  the  ATMS 
sense),  and  all  conclusions  that  follow  from  it  are  tagged 
with  a  label  that  includes  it  as  well  as  any  other  assump¬ 
tions  needed  to  derive  that  conclusion.  This  labelling  allows 
the  ATMS  to  represent  many  interwoven  deductive  closures, 
each  depending  on  different  misconceptions,  without  incon¬ 
sistency.  The  labels  allow  recovery  of  how  a  conclusion  was 
reached.  Thus  a  match  with  a  node  containing  a  buggy  as¬ 
sumption  indicates  the  student  has  a  common  error  or  mis¬ 
conception  and  which  error  or  misconception  it  is. 

Completeness  in  WHY2-ATLAS  is  relative  to  an  informal 
two-column  proof  generated  by  a  domain  expert.  A  human 
author  should  control  which  proof  is  used  for  checking  com¬ 
pleteness,  and  it  is  probably  less  work  for  an  author  to  write 
an  acceptable  proof  than  to  find  one  in  the  ATMS.  The  in¬ 
formal  proof  for  the  problem  in  Eigure  2  is  shown  in  Eig- 
ure  5  where  facts  appear  in  the  left  column  and  justifications 
that  are  physics  principles  appear  in  the  right  column.  Jus¬ 
tifications  are  further  categorized  as  vector  equations  (e.g. 
<Average  velocity  =  displacement  /  elapsed  time>,  in  step 
(12)  of  the  proof),  or  qualitative  rules  (e.g.  “so  if  average 
velocity  and  time  are  the  same,  so  is  displacement”  in  step 
(12)).  A  two-column  proof  is  represented  in  the  system  as  a 
directed  graph  in  which  nodes  are  facts,  vector  equations,  or 
qualitative  rules  that  have  been  translated  to  the  EOPL  rep¬ 
resentation  language  off-line.  The  single  sentence  analyzer 
can  be  used  to  assist  in  this  translation  but  a  developer  must 
still  review  and  refine  the  result.  The  edges  of  the  graph 
represent  the  inference  relations  between  the  premise  and 
conclusion  of  modus  ponens. 

Matches  of  input  representations  against  the  ATMS  and 
the  two-column  proof  (we  collectively  referred  to  these  ear¬ 
lier  as  the  correct  and  buggy  chains  of  reasoning)  do  not 
have  to  be  exact.  Eurther  flexibility  in  the  matching  process 
is  provided  by  examining  a  neighborhood  of  radius  N  (in 
terms  of  graph  distance)  from  matched  nodes  in  the  ATMS 
to  determine  whether  it  contains  any  of  the  nodes  of  the  two- 
column  proof.  This  provides  an  estimate  of  the  proximity  of 
a  student’s  utterance  to  nodes  of  the  two-column  proof.  Ad¬ 
ditional  details  on  correctness  and  completeness  analysis  are 
provided  in  (Makatchev  &  VanLehn  2005). 

Short-answer  classification 

Short-answer  classification  is  accomplished  using  the 
LCElex  flexible  left  corner  parser  that  is  part  of  CARMEL 


Step 

Fact 

Justification 

1 

The  only  force  on  the  keys  and  the  man  is  the  force  of 
gravity 

Forces  are  either  contact  forces  or  the  gravitational  force 

2 

The  magnitude  of  the  force  of  gravity  on  the  man  and  the 
keys  is  its  mass  times  g 

The  force  of  gravity  on  an  object  has  a  magnitude  of  its  mass  times 
g,  where  g  is  the  gravitational  acceleration 

10 

At  every  time  interval,  the  keys  and  the  man  have  the 
same  final  velocity 

<Acceleration  =  (final  velocity  -  initial  velocity)/elapsed  time>,  so 
for  two  objects,  if  the  acceleration,  initial  velocity  and  time  are  the 
same,  so  is  final  velocity. 

11 

The  man  and  the  keys  have  the  same  average  velocity 
while  falling 

If  acceleration  is  constant,  then  <average  velocity  =  (vf-l-vi)/2>,  so 
if  two  objects  have  the  same  vf  and  vi,  then  their  average  velocity  is 
the  same. 

12 

The  keys  and  the  man  have  the  same  displacements  at  all 
times 

<Average  velocity  =  displacement  /  elapsed  time>,  so  if  average 
velocity  and  time  are  the  same,  so  is  displacement. 

13 

The  keys  and  the  man  have  the  same  initial  vertical  po¬ 
sition 

given 

14 

The  keys  and  the  man  have  the  same  vertical  position  at 
all  times 

<Displacement  =  difference  in  position>,  so  if  the  initial  positions 
of  two  objects  are  the  same  and  their  displacements  are  the  same, 
then  so  is  their  final  position 

15 

The  keys  stay  in  front  of  the  man’s  face  at  all  times 

Figure  5;  Part  of  the  informal  “proof”  used  in  WHY2-ATLAS  for  the  Elevator  problem  in  Figure  2. 


(Rose  2000)  and  a  separate  semantic  grammar  for  each 
state  in  which  a  short  answer  response  is  expected,  al¬ 
though  some  rules  may  be  shared  by  other  states.  The 
classes  in  each  state  grammar  correspond  to  the  expected  re¬ 
sponses.  For  instance,  if  the  anticipated  responses  for  a  state 
are  “down”  and  “up”,  then  the  semantic  grammar  would 
have  two  rules  such  as  “statel_resp_classl  =>  down_class” 
and  “state  l_resp_class2  =>  up  .class”  where  down.class  and 
up.class  are  classes  that  may  be  shared  by  semantic  gram¬ 
mars  for  other  states.  The  classes  are  further  defined  by  rules 
such  as  “down.class  =>  ’down’  or  ’downward’  or  ’toward 
earth’ .  Because  the  LCFlex  parser  can  skip  words,  it  can  find 
certain  key  words  or  phrases  in  the  student’s  response  even  if 
they  are  surrounded  by  extra  words,  (e.g.  “It  is  downward.”). 
Thus  when  the  author  scripts  the  answer  classes  for  a  state, 
the  author  needs  to  list  as  many  phrasings  as  possible  that 
have  similar  semantics  but  can  omit  words  that  won’t  help 
distinguish  it  from  a  phrase  with  different  semantics  (e.g. 
“it”  or  “is”). 

Equation  Identification 

Equations  can  be  expressed  in  natural  language  (e.g.  net 
force  is  the  mass  times  the  acceleration),  in  algebraic  form 
(e.g.  f=ma),  or  in  natural  language  mixed  with  algebraic 
symbols  (e.g.  net  force  is  ma).  The  equation  identifier  tags 
each  of  these  expressions  in  a  student’s  input  as  a  seman¬ 
tic  unit.  Since  there  is  a  small  set  of  equations  to  consider 
(twelve  correct  and  seven  buggy  ones)  it  is  feasible  to  match 
directly  against  the  representations  of  these  equations.  The 
equation  identifier  does  this  matching  by  applying  a  series 
of  regular  expressions  before  invocation  of  explanation  or 
short-answer  classification.  Both  types  of  classification  are 
tolerant  of  formulas  that  have  been  replaced  by  tags  since 
they  can  either  skip  unknown  words  (CARMEL),  treat  them 
as  nouns  (RAPPEL),  or  be  trained  with  text  that  has  been 
tagged  for  equations  (RAPPEL  and  RAINBOW). 


System  Evaluation 

The  system  was  evaluated  in  the  context  of  testing  the  hy¬ 
pothesis  that  even  when  content  is  equivalent,  students  who 
engage  in  more  interactive  forms  of  instruction  learn  more. 
To  test  this  hypothesis  we  compared  students  who  received 
human  tutoring  with  students  who  read  a  short  text.  WHY2- 
ATLAS  and  WHY2-AUTOTUTOR  provided  a  third  type  of 
condition  that  served  as  an  interactive  form  of  instruction 
where  the  content  is  better  controlled  than  with  human  tutor¬ 
ing.  With  the  computer  tutors  only  the  same  content  covered 
in  the  text  condition  can  be  presented.  But  if  the  system  mis¬ 
interprets  any  of  a  student’s  multi-sentential  answers  it  may 
skip  material  covered  in  the  text  that  the  student  needs.  In 
all  conditions  the  students  solved  four  problems  that  require 
multi-sentential  answers,  one  of  which  is  shown  in  Figure  2. 

After  conducting  a  number  of  experiments  with  different 
subpopulations  and  adjustments  in  content  and  assessment 
materials,  we  found  that  overall  students  learn  and  learn 
equally  well  in  all  three  types  of  conditions  when  the  con¬ 
tent  is  appropriate  to  the  level  of  the  student  (VanLehn  et  al. 
2005).  That  is,  the  learning  gains  for  human  tutoring  and  the 
content  controlled  text  were  the  same.  Thus,  learning  gains 
alone  for  this  experimental  setup  can  only  reveal  whether 
the  computer  tutors  were  the  same  or  worse  than  the  text. 
A  system  could  perform  worse  if  it  too  frequently  misinter¬ 
prets  multi-sentential  answers  and  skips  material  covered  in 
the  text  that  a  student  may  need. 

For  the  version  of  WHY2-ATLAS  we  described,  the  learn¬ 
ing  gains  were  the  same  on  two  of  three  different  types  of 
post-tests  administered.  On  multiple-choice  and  essay  post¬ 
tests,  there  was  no  reliable  difference.  However,  on  fill- 
in-the-blank  post-tests,  the  WHY2-ATLAS  students  scored 
higher  than  the  text  students  (p=0.010;  F(l,74)=6.33),  and 
this  advantage  persisted  when  the  scores  were  adjusted 
by  factoring  out  pre-test  scores  in  an  ANCOVA  (p=0.018; 
F(l,72)=5.83).  Although  this  difference  was  in  the  expected 


direction,  it  was  not  accompanied  by  similar  differences  for 
the  other  two  post-tests.  These  learning  measures  show  that, 
relative  to  the  text,  the  two  systems’  overall  performance  at 
selecting  content  is  good.  But  since  the  dialogue  strategies 
in  the  two  systems  are  different  and  selected  relative  to  the 
understanding  techniques  used,  we  next  need  to  do  a  detailed 
corpus  analysis  of  the  language  data  collected  to  track  suc¬ 
cesses  and  failures  of  understanding  and  dialogue  strategy 
selection  relative  to  knowledge  components  in  the  post-test. 

During  an  informal  review  of  the  WHY2-ATLAS  corpus 
we  saw  that  the  strategy  of  walking  through  a  problem  had 
a  positive  impact  on  students  who  could  explain  little  ini¬ 
tially.  But  the  impact  of  eliciting  missing  pieces  of  an  ex¬ 
planation  was  mixed  and  requires  a  detailed  corpus  analysis. 
While  similar  to  WHY2-AUTOTUTOR’s  hints,  these  elicita¬ 
tions  hrst  summarize  the  correct  components  of  a  student’s 
explanation  that  lead  up  to  a  missing  or  incorrect  compo¬ 
nent.  We  expect  these  dialogues  to  be  more  cohesive,  com¬ 
pared  to  ones  using  decontextualized  hints,  because  they  use 
problem-solving  structure  to  present  an  integrated  partial  ex¬ 
planation. 

Conclusion 

We  described  a  tutoring  system  that  explores  deeper  un¬ 
derstanding  techniques  for  multi-sentential  explanations  and 
dialogue  strategies  that  depend  on  deeper  understanding. 
Compared  to  a  system  that  uses  shallower  understanding 
techniques,  there  were  no  measurable  differences  in  overall 
learning.  However,  overall  learning  measures  do  not  ade¬ 
quately  evaluate  the  utility  of  deeper  understanding  and  its 
associated  dialogue  strategies  since  it  assumes  that  under¬ 
standing  performance  and  strategy  choices  are  correct.  Thus 
our  next  step  will  be  a  detailed  corpus  analysis  that  exam¬ 
ines  correlations  between  student  learning  and  system  per¬ 
formance  during  tutoring. 
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